Skip to main content
โšก Calmops

Site Reliability Engineering: SRE Principles and Practices in 2026

Introduction

Site Reliability Engineering (SRE) has become the de facto standard for operating production systems in 2026. Originally pioneered by Google, SRE combines software engineering principles with operations work to create highly reliable systems. This guide covers the essential SRE concepts, practices, and tools that every engineering team should understand.

SRE is what happens when you treat operations as a software problem. SREs use software engineering to solve operational problems, automating tasks that were previously done manually.

The SRE Mindset

What’s Different About SRE?

Traditional Ops SRE
Manual changes Automated everything
Reactive Proactive
Fix on call Prevent on call
Uptime as goal Reliability as feature
Change avoidance Change acceleration

The SRE Venn Diagram

              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ”‚   Engineering   โ”‚
              โ”‚                 โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”
        โ”‚           โ”‚     โ”‚           โ”‚
        โ”‚  System   โ”‚     โ”‚   Softwareโ”‚
        โ”‚ Admin     โ”‚     โ”‚   Skills  โ”‚
        โ”‚           โ”‚     โ”‚           โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚                 โ”‚
              โ”‚   SRE Role     โ”‚
              โ”‚                 โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Service Level Indicators (SLIs)

What is an SLI?

An SLI is a quantitative measure of some aspect of the level of service:

# Common SLI calculations
slis = {
    "availability": "successful_requests / total_requests",
    "latency": "requests under threshold / total_requests", 
    "quality": "good_responses / total_responses",
    "freshness": "current_data / expected_data"
}

Types of SLIs

SLI Type Description Examples
Request-driven Based on user requests HTTP 2xx rate
Infrastructure Underlying components Disk usage
Derived Calculated from others End-to-end latency

Example SLI Definitions

# SLI specification
service: payment-service
slis:
  - name: availability
    description: "Percentage of successful requests"
    query: |
      sum(rate(payment_requests_total{status=~"2.."}[5m]))
      /
      sum(rate(payment_requests_total[5m]))
    objective: 0.9995
    
  - name: latency
    description: "95th percentile latency"
    query: |
      histogram_quantile(0.95, 
        rate(payment_request_duration_seconds_bucket[5m]))
    objective: 0.5  # 500ms
    
  - name: correctness
    description: "Percentage of correct responses"
    query: |
      sum(rate(payment_correct_responses_total[5m]))
      /
      sum(rate(payment_responses_total[5m]))
    objective: 0.999

Service Level Objectives (SLOs)

What is an SLO?

An SLO is a target value for an SLI:

slo = {
    "slis": ["availability", "latency", "correctness"],
    "targets": {
        "availability": 0.9995,  # 99.95%
        "latency_p95": 0.5,      # 500ms
        "correctness": 0.999     # 99.9%
    },
    "window": "30d"  # Rolling 30-day window
}

Choosing SLO Targets

Availability Downtime per Year Downtime per Month Use Case
90% (“1 nines”) 36.5 days 72 hours Internal tools
99% (“2 nines”) 3.65 days 7.3 hours Most services
99.9% (“3 nines”) 8.76 hours 43.8 minutes Customer-facing
99.99% (“4 nines”) 52.6 minutes 4.38 minutes Critical services
99.999% (“5 nines”) 5.26 minutes 26.3 seconds Ultra-critical

Error Budgets

An error budget is the allowed amount of unreliability:

# Error budget calculation
error_budget = {
    "target_availability": 0.999,  # 99.9%
    "window": "30d",
    "total_requests": 100_000_000,
    
    "allowed_errors": 100_000,  # 0.1% of 100M
    "remaining_budget": 95_000,  # After 50k errors
    "budget_burn_rate": "1.5x"  # How fast we're using it
}

SLO Status Dashboard

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Payment Service SLO                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  Availability (99.9%)                                      โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  87% remaining     โ”‚
โ”‚                                                             โ”‚
โ”‚  Latency P95 (500ms)                                       โ”‚
โ”‚  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘  92% remaining     โ”‚
โ”‚                                                             โ”‚
โ”‚  Error Budget Remaining: 45%                               โ”‚
โ”‚  โš ๏ธ  Warning: Burn rate 1.8x normal                       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Error Budget Policies

What Happens When SLOs Are Breached?

# Error budget policy
error_budget_policy:
  name: payment-service-availability
  
  conditions:
    - name: budget_exhausted
      metric: error_budget_remaining
      threshold: 0
      
  actions:
    - type: freeze_changes
      message: "Error budget exhausted. Feature freezes in effect."
      
    - type: page_oncall
      threshold: 0
      
    - type: escalation
      threshold: -0.1  # 10% over budget
      
  recovery:
    - type: allow_changes
      condition: error_budget_remaining > 0.1
      message: "Error budget recovered"

The “Four Golden Signals”

Google’s four golden signals for SRE:

golden_signals:
  latency:
    measure: "Time to process requests"
    sli: "p95 latency under 500ms"
    
  Traffic:
    measure: "Requests per second"
    sli: "System handles 10k RPS"
    
  Errors:
    measure: "Failed requests"
    sli: "Error rate under 0.1%"
    
  Saturation:
    measure: "System capacity"
    sli: "CPU under 80%, memory under 90%"

On-Call Practices

Building an On-Call Rotation

oncall_rotation = {
    "primary": "engineer_1",
    "secondary": "engineer_2", 
    "tertiary": "engineer_3",
    "rotation_period": "1 week",
    "handoff": "Monday 9am"
}

On-Call Responsibilities

Responsibility Time Frame
Respond to alerts < 15 minutes
Triage issues < 30 minutes
Diagnose problems < 2 hours
Escalate if needed As appropriate

Runbooks

# Example runbook
runbook:
  title: "High Error Rate on Payment Service"
  
  steps:
    - name: "Check Grafana dashboard"
      url: "https://grafana.example.com/d/payment-service"
      
    - name: "Check recent deployments"
      command: "git log --oneline -10"
      
    - name: "Check database connections"
      command: "kubectl exec -it payment-db -- psql -c 'SELECT count(*) FROM pg_stat_activity'"
      
    - name: "Check for circuit breaker events"
      command: "kubectl get cb -A"
      
    - name: "Rollback if needed"
      command: "argocd app rollback payment-service 1"
      
  common_issues:
    - "Database connection pool exhausted - scale up"
    - "Third-party payment provider down - check status page"
    - "Bad deploy - rollback immediately"

Post-Incident Reviews

The Blameless Post-Mortem

# Incident Post-Mortem

## Summary
- **Incident**: Payment service outage
- **Duration**: 45 minutes
- **Impact**: 2,000 failed transactions

## Timeline
- 10:00 - Alert triggered
- 10:05 - On-call acknowledged
- 10:15 - Root cause identified
- 10:35 - Fix deployed
- 10:45 - Service recovered

## Root Cause
Database connection pool exhausted due to connection leak in payment processor library.

## What Went Well
- Alert fired quickly
- Team responded promptly
- Rollback worked correctly

## What Could Be Improved
- Add connection pool monitoring
- Add circuit breaker to payment calls
- Improve test coverage for connection handling

## Action Items
- [ ] Add connection_pool_available SLI
- [ ] Add circuit breaker
- [ ] Add integration test for connection handling

SRE Tools in 2026

Monitoring & Observability

  • Prometheus: Metrics collection and alerting
  • Grafana: Visualization and dashboards
  • Jaeger: Distributed tracing
  • Loki: Log aggregation

Incident Management

  • PagerDuty: On-call and alerting
  • Opsgenie: Alert management
  • VictorOps: Incident response

Service Level Management

  • Sloth: SLO generator
  • Nobl9: SLO platform
  • Datadog SLO: Built-in SLO tracking

Implementing SRE

Step 1: Define Your SLIs

Start with the four golden signals:

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

Step 2: Set SLO Targets

Start conservative, adjust based on reality:

  • 99% for internal services
  • 99.9% for customer-facing
  • 99.99% for critical

Step 3: Measure and Monitor

Build dashboards showing:

  • Current SLO status
  • Error budget burn rate
  • Historical trends

Step 4: Create Error Budget Policies

Define what happens when SLOs are at risk:

  • Warning thresholds
  • Action items
  • Communication plans

Step 5: Automate Everything

SRE is about removing manual operational work:

  • Automated deployments
  • Automated scaling
  • Automated remediation

Conclusion

SRE provides a framework for building and operating reliable systems. By focusing on SLIs, SLOs, and error budgets, teams can make informed decisions about reliability investments. The key is to start simple, measure what matters, and continuously improve.

In 2026, SRE principles are essential for any team responsible for production systems. The shift from “keeping the lights on” to “engineering reliability” transforms how we think about operations.

Comments