SLOs & Error Budgets: Reliability Metrics That Matter

Introduction

Service Level Objectives (SLOs) define the reliability targets for your services. Error budgets quantify how much unreliability you can tolerate. Together, they enable data-driven decisions about when to prioritize reliability improvements versus feature development. Many organizations lack clear SLOs, resulting in either over-engineering or under-reliability.

This comprehensive guide covers SLO design and error budget management.

Core Concepts

SLO (Service Level Objective)

Target level of service reliability (e.g., 99.9% uptime).

SLA (Service Level Agreement)

Contractual commitment to SLO with penalties for missing it.

Error Budget

Amount of unreliability allowed within SLO (100% - SLO%).

Availability

Percentage of time service is operational.

Latency

Time to respond to requests.

Throughput

Requests handled per second.

Reliability

Probability service works correctly.

Burn Rate

Speed at which error budget is consumed.

SLO Design

Choosing SLO Targets

class SLODesigner:
    def __init__(self):
        self.services = {}
    
    def design_slo(self, service_name, criticality, user_impact):
        """Design appropriate SLO"""
        
        if criticality == 'critical' and user_impact == 'high':
            # Critical service: 99.99% (52 minutes/year downtime)
            slo = 0.9999
        elif criticality == 'high' and user_impact == 'high':
            # High priority: 99.9% (8.7 hours/year downtime)
            slo = 0.999
        elif criticality == 'medium':
            # Medium priority: 99% (3.7 days/year downtime)
            slo = 0.99
        else:
            # Low priority: 95% (18 days/year downtime)
            slo = 0.95
        
        self.services[service_name] = {
            'slo': slo,
            'error_budget': 1 - slo,
            'downtime_per_year': (1 - slo) * 365 * 24 * 60  # minutes
        }
        
        return self.services[service_name]

# Design SLOs
designer = SLODesigner()

designer.design_slo('payment-service', 'critical', 'high')
designer.design_slo('api-server', 'high', 'high')
designer.design_slo('analytics', 'medium', 'low')
designer.design_slo('admin-panel', 'low', 'low')

for service, slo_info in designer.services.items():
    print(f"{service}: {slo_info['slo']*100}% SLO")
    print(f"  Error budget: {slo_info['error_budget']*100}%")
    print(f"  Downtime allowed: {slo_info['downtime_per_year']:.0f} minutes/year")

SLO Metrics

class SLOMetrics:
    def __init__(self, slo_target):
        self.slo_target = slo_target
        self.error_budget = 1 - slo_target
    
    def calculate_availability(self, uptime_seconds, total_seconds):
        """Calculate actual availability"""
        return uptime_seconds / total_seconds
    
    def calculate_error_budget_remaining(self, actual_availability):
        """Calculate remaining error budget"""
        error_budget_used = 1 - actual_availability
        error_budget_remaining = self.error_budget - error_budget_used
        
        return {
            'used': error_budget_used,
            'remaining': error_budget_remaining,
            'percentage_remaining': (error_budget_remaining / self.error_budget) * 100
        }
    
    def calculate_burn_rate(self, error_budget_used, time_period_hours):
        """Calculate error budget burn rate"""
        # How many times faster than expected are we burning budget?
        expected_burn = self.error_budget / (365 * 24)  # per hour
        actual_burn = error_budget_used / time_period_hours
        
        return actual_burn / expected_burn

# Monitor SLO
metrics = SLOMetrics(slo_target=0.999)

# After 1 week
uptime = 604800 - 3600  # 1 hour downtime
total = 604800
availability = metrics.calculate_availability(uptime, total)

budget = metrics.calculate_error_budget_remaining(availability)
print(f"Availability: {availability*100:.3f}%")
print(f"Error budget remaining: {budget['percentage_remaining']:.1f}%")

# Calculate burn rate
burn_rate = metrics.calculate_burn_rate(1 - availability, 168)  # 168 hours in week
print(f"Burn rate: {burn_rate:.1f}x")

Error Budget Management

Error Budget Allocation

class ErrorBudgetManager:
    def __init__(self, monthly_error_budget):
        self.monthly_budget = monthly_error_budget
        self.allocated = {}
        self.used = {}
    
    def allocate_budget(self, category, percentage):
        """Allocate error budget to categories"""
        self.allocated[category] = self.monthly_budget * percentage
    
    def track_usage(self, category, downtime_minutes):
        """Track error budget usage"""
        if category not in self.used:
            self.used[category] = 0
        
        self.used[category] += downtime_minutes
    
    def get_remaining_budget(self, category):
        """Get remaining budget for category"""
        allocated = self.allocated.get(category, 0)
        used = self.used.get(category, 0)
        
        return allocated - used
    
    def should_deploy(self, category):
        """Decide if safe to deploy"""
        remaining = self.get_remaining_budget(category)
        
        # Only deploy if > 25% budget remaining
        return remaining > (self.allocated[category] * 0.25)

# Setup error budget management
# 99.9% SLO = 43.2 minutes/month error budget
manager = ErrorBudgetManager(monthly_error_budget=43.2)

# Allocate budget
manager.allocate_budget('deployments', 0.5)  # 50% for deployments
manager.allocate_budget('infrastructure', 0.3)  # 30% for infrastructure
manager.allocate_budget('external', 0.2)  # 20% for external issues

# Track usage
manager.track_usage('deployments', 10)  # 10 minutes downtime from deployment
manager.track_usage('infrastructure', 5)  # 5 minutes from infrastructure

# Check if safe to deploy
print(f"Safe to deploy: {manager.should_deploy('deployments')}")
print(f"Remaining budget: {manager.get_remaining_budget('deployments'):.1f} minutes")

Burn Rate Alerts

class BurnRateAlerter:
    def __init__(self, slo_target):
        self.slo_target = slo_target
        self.error_budget = 1 - slo_target
    
    def check_burn_rate(self, current_availability, time_window_hours):
        """Check if burn rate is too high"""
        
        error_budget_used = 1 - current_availability
        burn_rate = error_budget_used / time_window_hours
        expected_burn = self.error_budget / (365 * 24)
        
        burn_rate_multiplier = burn_rate / expected_burn
        
        # Alert levels
        if burn_rate_multiplier > 10:
            return 'critical', f"Burn rate: {burn_rate_multiplier:.1f}x"
        elif burn_rate_multiplier > 5:
            return 'warning', f"Burn rate: {burn_rate_multiplier:.1f}x"
        else:
            return 'ok', f"Burn rate: {burn_rate_multiplier:.1f}x"

# Monitor burn rate
alerter = BurnRateAlerter(slo_target=0.999)

# Check after 1 hour with 99% availability
status, message = alerter.check_burn_rate(0.99, 1)
print(f"Status: {status}, {message}")

SLO Implementation

Prometheus Queries

# Calculate availability
(1 - (rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]))) * 100

# Calculate error budget remaining
(1 - (1 - (rate(http_requests_total{status=~"5.."}[30d]) / rate(http_requests_total[30d])))) * 100

# Calculate burn rate
(rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h])) / (1 - 0.999)

Grafana Dashboard

{
  "dashboard": {
    "title": "SLO Dashboard",
    "panels": [
      {
        "title": "Availability",
        "targets": [
          {
            "expr": "(1 - (rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m]))) * 100"
          }
        ],
        "thresholds": [99.9]
      },
      {
        "title": "Error Budget Remaining",
        "targets": [
          {
            "expr": "(1 - (1 - (rate(http_requests_total{status=~\"5..\"}[30d]) / rate(http_requests_total[30d])))) * 100"
          }
        ]
      },
      {
        "title": "Burn Rate",
        "targets": [
          {
            "expr": "(rate(http_requests_total{status=~\"5..\"}[1h]) / rate(http_requests_total[1h])) / (1 - 0.999)"
          }
        ]
      }
    ]
  }
}

Best Practices

Align SLOs with Business: Match SLO to business criticality
Conservative SLOs: Set achievable targets
Monitor Burn Rate: Alert on high burn rates
Use Error Budget: Guide deployment decisions
Communicate SLOs: Share with stakeholders
Review Regularly: Adjust SLOs based on experience
Measure Accurately: Ensure accurate SLO measurement
Blameless: Use SLOs for improvement, not blame
Gradual Improvement: Improve SLOs incrementally
Document Rationale: Explain SLO choices

External Resources

SLO Resources

Tools

Conclusion

SLOs and error budgets enable data-driven reliability decisions. By setting clear targets and tracking error budget consumption, organizations balance reliability with feature development.

Start with conservative SLOs, monitor burn rate, and use error budget to guide deployment decisions.

SLOs are the foundation of reliability engineering.