Introduction
Site Reliability Engineering (SRE) has become the de facto standard for operating production systems in 2026. Originally pioneered by Google, SRE combines software engineering principles with operations work to create highly reliable systems. This guide covers the essential SRE concepts, practices, and tools that every engineering team should understand.
SRE is what happens when you treat operations as a software problem. SREs use software engineering to solve operational problems, automating tasks that were previously done manually.
The SRE Mindset
What’s Different About SRE?
| Traditional Ops | SRE |
|---|---|
| Manual changes | Automated everything |
| Reactive | Proactive |
| Fix on call | Prevent on call |
| Uptime as goal | Reliability as feature |
| Change avoidance | Change acceleration |
The SRE Venn Diagram
โโโโโโโโโโโโโโโโโโโ
โ Engineering โ
โ โ
โโโโโโโดโโโโโโ โโโโโโโดโโโโโโ
โ โ โ โ
โ System โ โ Softwareโ
โ Admin โ โ Skills โ
โ โ โ โ
โโโโโโโฌโโโโโโ โโโโโโโฌโโโโโโ
โ โ
โ SRE Role โ
โ โ
โโโโโโโโโโโโโโโโโโโ
Service Level Indicators (SLIs)
What is an SLI?
An SLI is a quantitative measure of some aspect of the level of service:
# Common SLI calculations
slis = {
"availability": "successful_requests / total_requests",
"latency": "requests under threshold / total_requests",
"quality": "good_responses / total_responses",
"freshness": "current_data / expected_data"
}
Types of SLIs
| SLI Type | Description | Examples |
|---|---|---|
| Request-driven | Based on user requests | HTTP 2xx rate |
| Infrastructure | Underlying components | Disk usage |
| Derived | Calculated from others | End-to-end latency |
Example SLI Definitions
# SLI specification
service: payment-service
slis:
- name: availability
description: "Percentage of successful requests"
query: |
sum(rate(payment_requests_total{status=~"2.."}[5m]))
/
sum(rate(payment_requests_total[5m]))
objective: 0.9995
- name: latency
description: "95th percentile latency"
query: |
histogram_quantile(0.95,
rate(payment_request_duration_seconds_bucket[5m]))
objective: 0.5 # 500ms
- name: correctness
description: "Percentage of correct responses"
query: |
sum(rate(payment_correct_responses_total[5m]))
/
sum(rate(payment_responses_total[5m]))
objective: 0.999
Service Level Objectives (SLOs)
What is an SLO?
An SLO is a target value for an SLI:
slo = {
"slis": ["availability", "latency", "correctness"],
"targets": {
"availability": 0.9995, # 99.95%
"latency_p95": 0.5, # 500ms
"correctness": 0.999 # 99.9%
},
"window": "30d" # Rolling 30-day window
}
Choosing SLO Targets
| Availability | Downtime per Year | Downtime per Month | Use Case |
|---|---|---|---|
| 90% (“1 nines”) | 36.5 days | 72 hours | Internal tools |
| 99% (“2 nines”) | 3.65 days | 7.3 hours | Most services |
| 99.9% (“3 nines”) | 8.76 hours | 43.8 minutes | Customer-facing |
| 99.99% (“4 nines”) | 52.6 minutes | 4.38 minutes | Critical services |
| 99.999% (“5 nines”) | 5.26 minutes | 26.3 seconds | Ultra-critical |
Error Budgets
An error budget is the allowed amount of unreliability:
# Error budget calculation
error_budget = {
"target_availability": 0.999, # 99.9%
"window": "30d",
"total_requests": 100_000_000,
"allowed_errors": 100_000, # 0.1% of 100M
"remaining_budget": 95_000, # After 50k errors
"budget_burn_rate": "1.5x" # How fast we're using it
}
SLO Status Dashboard
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Payment Service SLO โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Availability (99.9%) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 87% remaining โ
โ โ
โ Latency P95 (500ms) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 92% remaining โ
โ โ
โ Error Budget Remaining: 45% โ
โ โ ๏ธ Warning: Burn rate 1.8x normal โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Error Budget Policies
What Happens When SLOs Are Breached?
# Error budget policy
error_budget_policy:
name: payment-service-availability
conditions:
- name: budget_exhausted
metric: error_budget_remaining
threshold: 0
actions:
- type: freeze_changes
message: "Error budget exhausted. Feature freezes in effect."
- type: page_oncall
threshold: 0
- type: escalation
threshold: -0.1 # 10% over budget
recovery:
- type: allow_changes
condition: error_budget_remaining > 0.1
message: "Error budget recovered"
The “Four Golden Signals”
Google’s four golden signals for SRE:
golden_signals:
latency:
measure: "Time to process requests"
sli: "p95 latency under 500ms"
Traffic:
measure: "Requests per second"
sli: "System handles 10k RPS"
Errors:
measure: "Failed requests"
sli: "Error rate under 0.1%"
Saturation:
measure: "System capacity"
sli: "CPU under 80%, memory under 90%"
On-Call Practices
Building an On-Call Rotation
oncall_rotation = {
"primary": "engineer_1",
"secondary": "engineer_2",
"tertiary": "engineer_3",
"rotation_period": "1 week",
"handoff": "Monday 9am"
}
On-Call Responsibilities
| Responsibility | Time Frame |
|---|---|
| Respond to alerts | < 15 minutes |
| Triage issues | < 30 minutes |
| Diagnose problems | < 2 hours |
| Escalate if needed | As appropriate |
Runbooks
# Example runbook
runbook:
title: "High Error Rate on Payment Service"
steps:
- name: "Check Grafana dashboard"
url: "https://grafana.example.com/d/payment-service"
- name: "Check recent deployments"
command: "git log --oneline -10"
- name: "Check database connections"
command: "kubectl exec -it payment-db -- psql -c 'SELECT count(*) FROM pg_stat_activity'"
- name: "Check for circuit breaker events"
command: "kubectl get cb -A"
- name: "Rollback if needed"
command: "argocd app rollback payment-service 1"
common_issues:
- "Database connection pool exhausted - scale up"
- "Third-party payment provider down - check status page"
- "Bad deploy - rollback immediately"
Post-Incident Reviews
The Blameless Post-Mortem
# Incident Post-Mortem
## Summary
- **Incident**: Payment service outage
- **Duration**: 45 minutes
- **Impact**: 2,000 failed transactions
## Timeline
- 10:00 - Alert triggered
- 10:05 - On-call acknowledged
- 10:15 - Root cause identified
- 10:35 - Fix deployed
- 10:45 - Service recovered
## Root Cause
Database connection pool exhausted due to connection leak in payment processor library.
## What Went Well
- Alert fired quickly
- Team responded promptly
- Rollback worked correctly
## What Could Be Improved
- Add connection pool monitoring
- Add circuit breaker to payment calls
- Improve test coverage for connection handling
## Action Items
- [ ] Add connection_pool_available SLI
- [ ] Add circuit breaker
- [ ] Add integration test for connection handling
SRE Tools in 2026
Monitoring & Observability
- Prometheus: Metrics collection and alerting
- Grafana: Visualization and dashboards
- Jaeger: Distributed tracing
- Loki: Log aggregation
Incident Management
- PagerDuty: On-call and alerting
- Opsgenie: Alert management
- VictorOps: Incident response
Service Level Management
- Sloth: SLO generator
- Nobl9: SLO platform
- Datadog SLO: Built-in SLO tracking
Implementing SRE
Step 1: Define Your SLIs
Start with the four golden signals:
- Latency
- Traffic
- Errors
- Saturation
Step 2: Set SLO Targets
Start conservative, adjust based on reality:
- 99% for internal services
- 99.9% for customer-facing
- 99.99% for critical
Step 3: Measure and Monitor
Build dashboards showing:
- Current SLO status
- Error budget burn rate
- Historical trends
Step 4: Create Error Budget Policies
Define what happens when SLOs are at risk:
- Warning thresholds
- Action items
- Communication plans
Step 5: Automate Everything
SRE is about removing manual operational work:
- Automated deployments
- Automated scaling
- Automated remediation
Conclusion
SRE provides a framework for building and operating reliable systems. By focusing on SLIs, SLOs, and error budgets, teams can make informed decisions about reliability investments. The key is to start simple, measure what matters, and continuously improve.
In 2026, SRE principles are essential for any team responsible for production systems. The shift from “keeping the lights on” to “engineering reliability” transforms how we think about operations.
Comments