Introduction
Service Level Objectives provide a framework for making reliability decisions. Understanding error budgets and burn rate helps teams balance feature velocity with reliability.
Key Statistics:
- Teams with SLOs ship 40% faster with fewer incidents
- Error budgets align incentives between product and engineering
- 73% of SREs say burn rate alerts prevent outages
- Proper SLOs improve customer trust by 60%
SLO Framework
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SLO Hierarchy โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ SLI (Service Level Indicator) โ
โ โโโ Metric that measures a specific aspect of service behavior โ
โ - Request latency โ
โ - Error rate โ
โ - Availability โ
โ - Throughput โ
โ โ
โ SLO (Service Level Objective) โ
โ โโโ Target value or range for the SLI โ
โ - 99.9% availability โ
โ - p99 latency < 200ms โ
โ โ
โ SLA (Service Level Agreement) โ
โ โโโ Contractual commitment with customers โ
โ - 99.5% availability โ
โ - Penalty if breached โ
โ โ
โ Error Budget โ
โ โโโ 100% - SLO = allowable failure โ
โ - 99.9% SLO = 0.1% error budget โ
โ - For 30 days: 43 min 50 sec of downtime allowed โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
SLI Definition
Common SLIs
# SLI definitions for different service types
slis:
# Request-Response Services
- name: "API Availability"
description: "Percentage of successful API requests"
sli_type: "availability"
metrics:
- source: "prometheus"
query: |
sum(rate(http_requests_total{service="api",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{service="api"}[5m]))
target: 99.9
- name: "API Latency"
description: "Request latency at p99"
sli_type: "latency"
metrics:
- source: "prometheus"
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="api"}[5m])) by (le)
)
target: 0.2 # 200ms
# Data Processing
- name: "Pipeline Freshness"
description: "Age of most recent successful data"
sli_type: "freshness"
metrics:
- source: "prometheus"
query: |
time() - max(pipeline_completion_timestamp{})
target: 300 # 5 minutes
- name: "Data Correctness"
description: "Percentage of valid data processed"
sli_type: "correctness"
metrics:
- source: "prometheus"
query: |
sum(rate(data_validation_passed_total[5m]))
/
sum(rate(data_validation_total[5m]))
target: 99.95
# Storage
- name: "Storage Durability"
description: "Object retention success rate"
sli_type: "durability"
metrics:
- source: "prometheus"
query: |
1 - sum(rate(object_deletion_failed_total[5m])) / sum(rate(object_stored_total[5m]))
target: 99.9999999 # 11 9s
Error Budget Calculation
#!/usr/bin/env python3
"""Error budget calculator."""
from datetime import datetime, timedelta
class ErrorBudgetCalculator:
"""Calculate error budgets and burn rates."""
def __init__(self, slo_target: float, window: str = '30d'):
self.slo_target = slo_target
self.window = window
# Convert window to seconds
window_seconds = {
'7d': 7 * 86400,
'30d': 30 * 86400,
'90d': 90 * 86400,
}
self.window_seconds = window_seconds.get(window, 30 * 86400)
def calculate_error_budget(self) -> dict:
"""Calculate total error budget."""
error_budget_percent = (1 - self.slo_target) * 100
error_budget_seconds = self.window_seconds * (1 - self.slo_target)
return {
'slo_target': f"{self.slo_target * 100}%",
'error_budget': f"{error_budget_percent}%",
'allowed_downtime_seconds': error_budget_seconds,
'allowed_downtime_formatted': self._format_duration(error_budget_seconds)
}
def calculate_current_status(self, current_error_rate: float) -> dict:
"""Calculate current error budget status."""
total_budget = self.window_seconds * (1 - self.slo_target)
consumed_budget = self.window_seconds * current_error_rate
remaining_budget = total_budget - consumed_budget
# Calculate burn rate (how fast we're consuming budget)
# Assuming 30-day rolling window
burn_rate = current_error_rate / (1 - self.slo_target)
# Time to exhaustion
if burn_rate > 1:
time_to_exhaustion = remaining_budget / (burn_rate - 1) / self.window_seconds * 30
else:
time_to_exhaustion = float('inf')
return {
'total_budget_seconds': total_budget,
'consumed_seconds': consumed_budget,
'remaining_seconds': remaining_budget,
'remaining_percent': remaining_budget / total_budget * 100,
'burn_rate': burn_rate,
'time_to_exhaustion_days': time_to_exhaustion,
'status': self._get_status(burn_rate)
}
def _format_duration(self, seconds: float) -> str:
"""Format duration human-readable."""
days = int(seconds // 86400)
hours = int((seconds % 86400) // 3600)
minutes = int((seconds % 3600) // 60)
parts = []
if days > 0:
parts.append(f"{days}d")
if hours > 0:
parts.append(f"{hours}h")
if minutes > 0:
parts.append(f"{minutes}m")
return ' '.join(parts) if parts else '0m'
def _get_status(self, burn_rate: float) -> str:
"""Get status based on burn rate."""
if burn_rate > 1.5:
return 'critical'
elif burn_rate > 1.0:
return 'warning'
elif burn_rate > 0.7:
return 'caution'
else:
return 'healthy'
Burn Rate Alerts
# Prometheus alerting rules for SLOs
groups:
- name: slo-alerts
interval: 30s
rules:
# High burn rate alert
- alert: HighBurnRate
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="api"}[1h]))
)
/ (1 - 0.999) > 1.5
for: 5m
labels:
severity: critical
annotations:
summary: "High SLO burn rate for API service"
description: |
Burn rate is {{ $value | humanizePercentage }}.
At this rate, error budget will be exhausted in
{{ $value | without 1 | mul 30 | round 1 }} days.
# Fast burn rate (short window)
- alert: FastBurnRate
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[10m]))
/
sum(rate(http_requests_total{service="api"}[10m]))
)
/ (1 - 0.999) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Fast burn rate - immediate action required"
description: |
Extremely high burn rate detected.
Immediate investigation required.
# Budget exhausted
- alert: ErrorBudgetExhausted
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[30d]))
/
sum(rate(http_requests_total{service="api"}[30d]))
)
/ (1 - 0.999) >= 1
for: 0m
labels:
severity: critical
annotations:
summary: "Error budget exhausted!"
description: |
Error budget has been exhausted.
All downtime will impact SLA.
# Budget consumed warning
- alert: ErrorBudgetWarning
expr: |
(
sum(rate(http_requests_total{service="api",status=~"5.."}[30d]))
/
sum(rate(http_requests_total{service="api"}[30d]))
)
/ (1 - 0.999) > 0.9
for: 1h
labels:
severity: warning
annotations:
summary: "Error budget nearly exhausted"
description: |
{{ $value | humanizePercentage }} of error budget consumed.
Consider halting risky deployments.
Error Budget Policy
# Error Budget Policy
## Principles
1. **Error budgets are for the team to spend**
- Teams decide when to ship risky features
- No permission needed to use error budget
2. **Transparency**
- Error budget status is visible to everyone
- Weekly error budget review in team standup
3. **Consequences**
- When budget < 50%: Feature freeze
- When budget < 25%: Incident review required
- When budget exhausted: Emergency freeze
## Actions
| Burn Rate | Action |
|-----------|--------|
| < 70% | Normal operations |
| 70-90% | Increase vigilance, review pending changes |
| 90-100% | Pause non-critical deploys, incident review |
| > 100% | Feature freeze, incident response |
## Quarterly Planning
- Budget resets quarterly
- Historical burn rate informs next quarter planning
- Account for planned maintenance windows
Comments