Skip to main content
โšก Calmops

Incident Management: Building Effective On-Call and Response Practices

Introduction

Every production system will experience incidents. The difference between organizations that recover quickly and those that struggle lies in their incident management practices. This guide covers building effective incident response processes, on-call rotations, and creating a culture that learns from failures.

Incident management is the discipline of responding to and recovering from service disruptions efficiently and effectively.

Incident Lifecycle

Phases

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Incident Lifecycle                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ Detect  โ”‚โ”€โ”€โ–ถโ”‚ Respond โ”‚โ”€โ”€โ–ถโ”‚ Resolve โ”‚โ”€โ”€โ–ถโ”‚ Review   โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚       โ”‚             โ”‚             โ”‚              โ”‚        โ”‚
โ”‚       โ–ผ             โ–ผ             โ–ผ              โ–ผ        โ”‚
โ”‚  Alerting      Escalation    Recovery        Learning     โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Detection

# Prometheus alerting rules
groups:
  - name: payment-service
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(payment_requests_total{status=~"5.."}[5m])) 
          / sum(rate(payment_requests_total[5m])) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on payment service"
          description: "Error rate is {{ $value | humanizePercentage }}"
          
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(payment_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on payment service"

On-Call Practices

Building Rotations

# On-call rotation configuration
oncall_config = {
    "primary": {
        "engineer": "[email protected]",
        "phone": "+1234567890",
        "hours": "Monday 9am - Monday 9am"
    },
    "secondary": {
        "engineer": "[email protected]",
        "hours": "Always available for escalation"
    },
    "rotation": {
        "type": "weekly",
        "start": "Monday 9am UTC",
        "handoff_meeting": "Friday 3pm"
    }
}

Escalation Policy

# Escalation tiers
escalation_policy:
  - level: 1
    name: "On-Call Engineer"
    response_time: "15 minutes"
    contacts:
      - phone: oncall_phone
      - slack: "#oncall"
      
  - level: 2
    name: "Team Lead"
    response_time: "30 minutes"
    contacts:
      - phone: team_lead_phone
      - slack: "#engineering"
      
  - level: 3
    name: "Engineering Manager"
    response_time: "1 hour"
    contacts:
      - phone: manager_phone
      - slack: "#leadership"

  - level: 4
    name: "VP Engineering"
    response_time: "2 hours"
    contacts:
      - phone: vp_phone

Runbooks

# Runbook: Payment Service High Error Rate

## Symptoms
- Alert: HighErrorRate on payment-service
- Error rate > 5%

## Impact
- Users cannot complete payments
- Revenue impact immediately

## Steps

### 1. Check Service Health
```bash
kubectl get pods -n payments
kubectl describe deployment payment-service

2. Check Recent Deployments

kubectl rollout history deployment payment-service
git log --oneline -10 --since="1 hour ago"

3. Check Logs

kubectl logs -n payments -l app=payment-service --tail=100

4. Common Fixes

If recent deployment:

kubectl rollout undo deployment/payment-service

If database issue:

kubectl exec -it payment-db-0 -- psql -c "SELECT * FROM pg_stat_activity"

If third-party outage:

  • Check payment provider status page
  • Enable circuit breaker

5. Escalate if:

  • Issue persists > 30 minutes
  • Database investigation needed
  • Need to rollback

Contact

  • Primary: @oncall
  • Secondary: @payments-team-lead
  • Emergency: @engineering-manager

## Incident Communication

### Status Page

```yaml
# Automated status page updates
status_page:
  components:
    - name: "Payment Processing"
      status: "operational"  # operational, degraded, outage
    - name: "User Authentication"
      status: "operational"
      
  incidents:
    - id: "inc-123"
      status: "investigating"
      title: "Payment Processing Issues"
      body: "We are currently investigating reports of payment failures"
      updates:
        - status: "identified"
          body: "Root cause identified as database connection issue"
          created_at: "2026-03-12T10:30:00Z"
        - status: "resolved"
          body: "Issue has been resolved"
          created_at: "2026-03-12T11:00:00Z"

Incident Comms Template

# Incident Communication Template

## Initial Alert (within 15 min)
**Subject:** [INCIDENT] Payment Service Outage - Investigating

**Status:** ๐Ÿ”ด Investigating

**Impact:** Users unable to complete purchases

**What we're doing:** Looking into elevated error rates

**Next update:** In 30 minutes

---

## Update (every 30 min)

**Subject:** [INCIDENT] Payment Service Outage - Update #2

**Status:** ๐ŸŸก Identified

**Root cause:** Database connection pool exhausted due to connection leak

**What we're doing:** 
- Rolling back to previous version
- Clearing stuck connections

**ETA:** 15 minutes

**Next update:** In 15 minutes

---

## Resolution

**Subject:** [INCIDENT] Payment Service - Resolved

**Status:** ๐ŸŸข Resolved

**Summary:** Deployed fix for connection leak, service recovering

**Root cause:** Connection leak in payment processor library

**Action items:**
- [ ] Monitor connection pool metrics
- [ ] Add circuit breaker
- [ ] Schedule post-mortem

**Learnings:** Post-mortem to follow within 48 hours

Post-Incident Reviews

Blameless Post-Mortem Template

# Post-Mortem: Payment Service Outage
**Date:** March 12, 2026
**Duration:** 45 minutes
**Severity:** SEV-1

## Summary
Payment service experienced complete outage for 45 minutes affecting all transactions.

## Timeline (UTC)
- 10:15 - Alert fired: High error rate
- 10:17 - On-call acknowledged
- 10:22 - Root cause identified: Connection pool exhausted
- 10:35 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - All systems operational

## Root Cause
Connection leak in payment processor library introduced in v2.3.0. 
Connections were not being returned to pool after successful transactions.

## Impact
- 2,347 failed transactions
- ~$50,000 in estimated lost revenue
- 45 minutes of degraded service

## What Went Well
- Alert fired quickly (< 2 min from issue start)
- Team responded promptly
- Rollback worked correctly

## What Could Be Improved
- Add connection pool monitoring before deployment
- Better staging environment parity
- Faster escalation path for DB issues

## Action Items
| Task | Owner | Due |
|------|-------|-----|
| Add connection_pool_available SLI | @alice | Mar 15 |
| Add circuit breaker to payment calls | @bob | Mar 20 |
| Improve staging DB load testing | @carol | Apr 1 |
| Review all library upgrades | @team | Ongoing |

Metrics and Monitoring

Key Metrics

Metric Target Why
MTTR < 30 min Minimize impact
MTTD < 5 min Detect fast
MTBF > 7 days System reliability
False positive rate < 5% Trust in alerts

Alert Quality

# Alert effectiveness metrics
alert_metrics = {
    "total_alerts": 100,
    "actionable_alerts": 85,
    "false_positives": 5,
    "noise_rate": "5%",
    "time_to_acknowledge_avg": "3 min",
    "time_to_resolve_avg": "15 min"
}

Best Practices

  1. Automate detection: Don’t rely on user reports
  2. Respond quickly: Acknowledge within 15 minutes
  3. Communicate transparently: Keep stakeholders informed
  4. Document everything: Capture timeline and actions
  5. Review blamelessly: Focus on systems, not people
  6. Continuously improve: Act on learnings

Conclusion

Effective incident management minimizes impact and builds team confidence. By investing in detection, response processes, and learning from failures, organizations can continuously improve their reliability and customer trust.

Comments