SLO Implementation: Error Budgets, Burn Rate

Introduction

Service Level Objectives provide a framework for making reliability decisions. Understanding error budgets and burn rate helps teams balance feature velocity with reliability.

Key Statistics:

Teams with SLOs ship 40% faster with fewer incidents
Error budgets align incentives between product and engineering
73% of SREs say burn rate alerts prevent outages
Proper SLOs improve customer trust by 60%

SLO Framework

┌─────────────────────────────────────────────────────────────────┐
│                    SLO Hierarchy                                    │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SLI (Service Level Indicator)                                     │
│  └── Metric that measures a specific aspect of service behavior   │
│      - Request latency                                             │
│      - Error rate                                                  │
│      - Availability                                                │
│      - Throughput                                                  │
│                                                                  │
│  SLO (Service Level Objective)                                     │
│  └── Target value or range for the SLI                             │
│      - 99.9% availability                                         │
│      - p99 latency < 200ms                                         │
│                                                                  │
│  SLA (Service Level Agreement)                                      │
│  └── Contractual commitment with customers                         │
│      - 99.5% availability                                         │
│      - Penalty if breached                                         │
│                                                                  │
│  Error Budget                                                      │
│  └── 100% - SLO = allowable failure                               │
│      - 99.9% SLO = 0.1% error budget                             │
│      - For 30 days: 43 min 50 sec of downtime allowed            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

SLI Definition

Common SLIs

# SLI definitions for different service types

slis:
  # Request-Response Services
  - name: "API Availability"
    description: "Percentage of successful API requests"
    sli_type: "availability"
    metrics:
      - source: "prometheus"
        query: |
          sum(rate(http_requests_total{service="api",status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total{service="api"}[5m]))
    target: 99.9
    
  - name: "API Latency"
    description: "Request latency at p99"
    sli_type: "latency"
    metrics:
      - source: "prometheus"
        query: |
          histogram_quantile(0.99,
            sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le)
          )
    target: 0.2  # 200ms
    
  # Data Processing
  - name: "Pipeline Freshness"
    description: "Age of most recent successful data"
    sli_type: "freshness"
    metrics:
      - source: "prometheus"
        query: |
          time() - max(pipeline_completion_timestamp{})
    target: 300  # 5 minutes
    
  - name: "Data Correctness"
    description: "Percentage of valid data processed"
    sli_type: "correctness"
    metrics:
      - source: "prometheus"
        query: |
          sum(rate(data_validation_passed_total[5m]))
          /
          sum(rate(data_validation_total[5m]))
    target: 99.95
    
  # Storage
  - name: "Storage Durability"
    description: "Object retention success rate"
    sli_type: "durability"
    metrics:
      - source: "prometheus"
        query: |
          1 - sum(rate(object_deletion_failed_total[5m])) / sum(rate(object_stored_total[5m]))
    target: 99.9999999  # 11 9s

Error Budget Calculation

#!/usr/bin/env python3
"""Error budget calculator."""

from datetime import datetime, timedelta

class ErrorBudgetCalculator:
    """Calculate error budgets and burn rates."""
    
    def __init__(self, slo_target: float, window: str = '30d'):
        self.slo_target = slo_target
        self.window = window
        
        # Convert window to seconds
        window_seconds = {
            '7d': 7 * 86400,
            '30d': 30 * 86400,
            '90d': 90 * 86400,
        }
        
        self.window_seconds = window_seconds.get(window, 30 * 86400)
    
    def calculate_error_budget(self) -> dict:
        """Calculate total error budget."""
        
        error_budget_percent = (1 - self.slo_target) * 100
        error_budget_seconds = self.window_seconds * (1 - self.slo_target)
        
        return {
            'slo_target': f"{self.slo_target * 100}%",
            'error_budget': f"{error_budget_percent}%",
            'allowed_downtime_seconds': error_budget_seconds,
            'allowed_downtime_formatted': self._format_duration(error_budget_seconds)
        }
    
    def calculate_current_status(self, current_error_rate: float) -> dict:
        """Calculate current error budget status."""
        
        total_budget = self.window_seconds * (1 - self.slo_target)
        consumed_budget = self.window_seconds * current_error_rate
        remaining_budget = total_budget - consumed_budget
        
        # Calculate burn rate (how fast we're consuming budget)
        # Assuming 30-day rolling window
        burn_rate = current_error_rate / (1 - self.slo_target)
        
        # Time to exhaustion
        if burn_rate > 1:
            time_to_exhaustion = remaining_budget / (burn_rate - 1) / self.window_seconds * 30
        else:
            time_to_exhaustion = float('inf')
        
        return {
            'total_budget_seconds': total_budget,
            'consumed_seconds': consumed_budget,
            'remaining_seconds': remaining_budget,
            'remaining_percent': remaining_budget / total_budget * 100,
            'burn_rate': burn_rate,
            'time_to_exhaustion_days': time_to_exhaustion,
            'status': self._get_status(burn_rate)
        }
    
    def _format_duration(self, seconds: float) -> str:
        """Format duration human-readable."""
        
        days = int(seconds // 86400)
        hours = int((seconds % 86400) // 3600)
        minutes = int((seconds % 3600) // 60)
        
        parts = []
        if days > 0:
            parts.append(f"{days}d")
        if hours > 0:
            parts.append(f"{hours}h")
        if minutes > 0:
            parts.append(f"{minutes}m")
        
        return ' '.join(parts) if parts else '0m'
    
    def _get_status(self, burn_rate: float) -> str:
        """Get status based on burn rate."""
        
        if burn_rate > 1.5:
            return 'critical'
        elif burn_rate > 1.0:
            return 'warning'
        elif burn_rate > 0.7:
            return 'caution'
        else:
            return 'healthy'

Burn Rate Alerts

# Prometheus alerting rules for SLOs
groups:
  - name: slo-alerts
    interval: 30s
    rules:
      # High burn rate alert
      - alert: HighBurnRate
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[1h])) 
            / 
            sum(rate(http_requests_total{service="api"}[1h]))
          ) 
          / (1 - 0.999) > 1.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High SLO burn rate for API service"
          description: |
            Burn rate is {{ $value | humanizePercentage }}.
            At this rate, error budget will be exhausted in 
            {{ $value | without 1 | mul 30 | round 1 }} days.
            
      # Fast burn rate (short window)
      - alert: FastBurnRate
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[10m])) 
            / 
            sum(rate(http_requests_total{service="api"}[10m]))
          ) 
          / (1 - 0.999) > 3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Fast burn rate - immediate action required"
          description: |
            Extremely high burn rate detected. 
            Immediate investigation required.
            
      # Budget exhausted
      - alert: ErrorBudgetExhausted
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[30d])) 
            / 
            sum(rate(http_requests_total{service="api"}[30d]))
          ) 
          / (1 - 0.999) >= 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Error budget exhausted!"
          description: |
            Error budget has been exhausted.
            All downtime will impact SLA.
            
      # Budget consumed warning
      - alert: ErrorBudgetWarning
        expr: |
          (
            sum(rate(http_requests_total{service="api",status=~"5.."}[30d])) 
            / 
            sum(rate(http_requests_total{service="api"}[30d]))
          ) 
          / (1 - 0.999) > 0.9
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Error budget nearly exhausted"
          description: |
            {{ $value | humanizePercentage }} of error budget consumed.
            Consider halting risky deployments.

Error Budget Policy

# Error Budget Policy

## Principles

1. **Error budgets are for the team to spend**
   - Teams decide when to ship risky features
   - No permission needed to use error budget

2. **Transparency**
   - Error budget status is visible to everyone
   - Weekly error budget review in team standup

3. **Consequences**
   - When budget < 50%: Feature freeze
   - When budget < 25%: Incident review required
   - When budget exhausted: Emergency freeze

## Actions

| Burn Rate | Action |
|-----------|--------|
| < 70% | Normal operations |
| 70-90% | Increase vigilance, review pending changes |
| 90-100% | Pause non-critical deploys, incident review |
| > 100% | Feature freeze, incident response |

## Quarterly Planning

- Budget resets quarterly
- Historical burn rate informs next quarter planning
- Account for planned maintenance windows

SLO Implementation: Error Budgets, Burn Rate

Introduction

SLO Framework

SLI Definition

Common SLIs

Error Budget Calculation

Burn Rate Alerts

Error Budget Policy

External Resources

Comments