Skip to main content
โšก Calmops

Circuit Breaker Pattern: Preventing Cascade Failures in Distributed Systems

Introduction

In distributed systems, a single failing service can cascade failures throughout your entire infrastructure. When a downstream service becomes slow or unavailable, waiting for timeouts can exhaust resources and bring down your entire application. The Circuit Breaker pattern provides a solution by acting as a proxy that monitors failures and “trips” to stop requests from overwhelming failing services.

The Problem: Cascade Failures

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Cascade Failure Scenario                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚   User Request                                                  โ”‚
โ”‚        โ”‚                                                        โ”‚
โ”‚        โ–ผ                                                        โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                               โ”‚
โ”‚   โ”‚  Service A  โ”‚                                               โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                               โ”‚
โ”‚          โ”‚ (slow/down)                                           โ”‚
โ”‚          โ–ผ                                                        โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                           โ”‚
โ”‚   โ”‚  Service B  โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  Service C  โ”‚                           โ”‚
โ”‚   โ”‚  (timeout)  โ”‚     โ”‚  (healthy)   โ”‚                           โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                           โ”‚
โ”‚          โ”‚                                                        โ”‚
โ”‚          โ–ผ                                                        โ”‚
โ”‚   Threads blocked waiting for Service B                          โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                          โ”‚
โ”‚   Memory exhaustion โ†’ More timeouts โ†’ Crash                      โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

How Circuit Breaker Works

The Circuit Breaker acts as a state machine with three states:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  Circuit Breaker States                                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    failure threshold    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚    โ”‚  CLOSED  โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚   OPEN   โ”‚          โ”‚
โ”‚    โ”‚ (normal) โ”‚                         โ”‚ (blocked) โ”‚          โ”‚
โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                         โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚         โ–ฒ                                      โ”‚                 โ”‚
โ”‚         โ”‚         success threshold           โ”‚                 โ”‚
โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚                   (half-open)                                      โ”‚
โ”‚                                                                 โ”‚
โ”‚  CLOSED:  Requests pass through normally                         โ”‚
โ”‚           Failures are counted                                    โ”‚
โ”‚                                                                 โ”‚
โ”‚  OPEN:    Requests fail immediately                               โ”‚
โ”‚           Returns fallback response                               โ”‚
โ”‚                                                                 โ”‚
โ”‚  HALF-OPEN: Test requests allowed                                โ”‚
โ”‚             If successful โ†’ CLOSED                               โ”‚
โ”‚             If failed โ†’ OPEN again                                โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Implementation Examples

Python Implementation

import time
import threading
from enum import Enum
from typing import Callable, Any, Optional
from functools import wraps

class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: int = 60,
        success_threshold: int = 3,
        expected_exception: type = Exception
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.success_threshold = success_threshold
        self.expected_exception = expected_exception
        
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self._lock = threading.Lock()
    
    def call(self, func: Callable, *args, **kwargs) -> Any:
        if self.state == CircuitState.OPEN:
            if self._should_attempt_reset():
                self._transition_to_half_open()
            else:
                raise CircuitBreakerOpenError("Circuit breaker is OPEN")
        
        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except self.expected_exception as e:
            self._on_failure()
            raise e
    
    def _should_attempt_reset(self) -> bool:
        if self.last_failure_time is None:
            return False
        return (time.time() - self.last_failure_time) >= self.recovery_timeout
    
    def _on_success(self):
        with self._lock:
            if self.state == CircuitState.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.success_threshold:
                    self._transition_to_closed()
            else:
                self.failure_count = 0
    
    def _on_failure(self):
        with self._lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            if self.state == CircuitState.HALF_OPEN:
                self._transition_to_open()
            elif self.failure_count >= self.failure_threshold:
                self._transition_to_open()
    
    def _transition_to_open(self):
        self.state = CircuitState.OPEN
        self.success_count = 0
    
    def _transition_to_half_open(self):
        self.state = CircuitState.HALF_OPEN
        self.success_count = 0
    
    def _transition_to_closed(self):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None

class CircuitBreakerOpenError(Exception):
    pass

def circuit_breaker(circuit: CircuitBreaker):
    def decorator(func: Callable) -> Callable:
        @wraps(func)
        def wrapper(*args, **kwargs):
            return circuit.call(func, *args, **kwargs)
        return wrapper
        return decorator

Using the Circuit Breaker

import requests

# Create circuit breaker instance
circuit = CircuitBreaker(
    failure_threshold=3,
    recovery_timeout=30,
    success_threshold=2
)

# Apply to a function
@circuit_breaker(circuit)
def call_external_api(url: str) -> dict:
    response = requests.get(url, timeout=5)
    response.raise_for_status()
    return response.json()

# Usage with fallback
def get_user_data(user_id: str) -> Optional[dict]:
    try:
        return call_external_api(f"https://api.example.com/users/{user_id}")
    except CircuitBreakerOpenError:
        # Return cached data or default response
        return get_cached_user(user_id)
    except requests.RequestException:
        return get_cached_user(user_id)

Java Implementation with Resilience4j

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.circuitbreaker.CircuitBreakerRegistry;

import java.time.Duration;
import java.util.function.Supplier;

public class CircuitBreakerExample {
    
    public static void main(String[] args) {
        // Configure circuit breaker
        CircuitBreakerConfig config = CircuitBreakerConfig.custom()
            .failureRateThreshold(50)                    // Open at 50% failure rate
            .slowCallRateThreshold(100)                  // Slow call threshold
            .slowCallDurationThreshold(Duration.ofSeconds(2))
            .waitDurationInOpenState(Duration.ofSeconds(30))
            .permittedNumberOfCallsInHalfOpenState(3)
            .slidingWindowType(CircuitBreakerConfig.SlidingWindowType.COUNT_BASED)
            .slidingWindowSize(10)
            .minimumNumberOfCalls(5)
            .build();
        
        CircuitBreakerRegistry registry = CircuitBreakerRegistry.of(config);
        CircuitBreaker circuitBreaker = registry.circuitBreaker("externalService");
        
        // Decorate your function
        Supplier<String> decoratedSupplier = CircuitBreaker
            .decorateSupplier(circuitBreaker, () -> {
                return callExternalService();
            });
        
        // Execute with fallback
        String result = Try.ofSupplier(decoratedSupplier)
            .recover(Exception.class, e -> {
                return getFallbackData();
            })
            .get();
    }
    
    private static String callExternalService() {
        // Your external service call here
        return "Success";
    }
    
    private static String getFallbackData() {
        return "Fallback data";
    }
}

Go Implementation

package circuitbreaker

import (
    "errors"
    "sync"
    "time"
)

type State int

const (
    Closed State = iota
    Open
    HalfOpen
)

type CircuitBreaker struct {
    mu             sync.Mutex
    state          State
    failureCount   int
    successCount   int
    lastFailure    time.Time
    
    // Configuration
    FailureThreshold int
    SuccessThreshold int
    Timeout         time.Duration
}

func New(failureThreshold, successThreshold int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        state:            Closed,
        FailureThreshold: failureThreshold,
        SuccessThreshold: successThreshold,
        Timeout:          timeout,
    }
}

func (cb *CircuitBreaker) Execute(fn func() error) error {
    cb.mu.Lock()
    defer cb.mu.Unlock()
    
    if cb.state == Open {
        if time.Since(cb.lastFailure) > cb.Timeout {
            cb.state = HalfOpen
            cb.successCount = 0
        } else {
            return errors.New("circuit breaker is open")
        }
    }
    
    err := fn()
    
    if err != nil {
        cb.handleFailure()
        return err
    }
    
    cb.handleSuccess()
    return nil
}

func (cb *CircuitBreaker) handleFailure() {
    cb.failureCount++
    cb.lastFailure = time.Now()
    
    if cb.state == HalfOpen || cb.failureCount >= cb.FailureThreshold {
        cb.state = Open
    }
}

func (cb *CircuitBreaker) handleSuccess() {
    cb.failureCount = 0
    
    if cb.state == HalfOpen {
        cb.successCount++
        if cb.successCount >= cb.SuccessThreshold {
            cb.state = Closed
        }
    }
}

Configuration Best Practices

Setting Thresholds

# Conservative settings for critical services
critical_circuit = CircuitBreaker(
    failure_threshold=3,      # Open after 3 failures
    recovery_timeout=30,      # Try again after 30 seconds
    success_threshold=2       # Need 2 successes to close
)

# Relaxed settings for non-critical services
non_critical_circuit = CircuitBreaker(
    failure_threshold=10,     # Open after 10 failures
    recovery_timeout=120,     # Try again after 2 minutes
    success_threshold=5       # Need 5 successes to close
)

Monitoring and Observability

import logging
from datetime import datetime

class ObservableCircuitBreaker(CircuitBreaker):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.logger = logging.getLogger(__name__)
        self.state_changes = []
    
    def _transition_to_open(self):
        super()._transition_to_open()
        self._log_state_change("OPEN")
        self._alert_on_call()
    
    def _transition_to_closed(self):
        super()._transition_to_closed()
        self._log_state_change("CLOSED")
    
    def _transition_to_half_open(self):
        super()._transition_to_half_open()
        self._log_state_change("HALF_OPEN")
    
    def _log_state_change(self, new_state: str):
        event = {
            "timestamp": datetime.utcnow().isoformat(),
            "state": new_state,
            "failure_count": self.failure_count
        }
        self.state_changes.append(event)
        self.logger.warning(f"Circuit breaker transitioned to {new_state}")
    
    def _alert_on_call(self):
        # Send alert to monitoring system
        pass

Fallback Strategies

Different Fallback Approaches

# 1. Return cached data
def get_product_with_cache(product_id: str) -> dict:
    circuit = CircuitBreaker(failure_threshold=3, recovery_timeout=60)
    
    try:
        return circuit.call(get_product_from_api, product_id)
    except CircuitBreakerOpenError:
        return get_product_from_cache(product_id)

# 2. Return default values
def calculate_discount_with_default(order_id: str) -> float:
    try:
        return call_pricing_service(order_id)
    except (CircuitBreakerOpenError, requests.RequestException):
        return 0.0  # Default: no discount

# 3. Queue for later processing
def process_order_with_queue(order: dict) -> bool:
    try:
        return call_inventory_service(order)
    except (CircuitBreakerOpenError, requests.RequestException):
        queue_order_for_retry(order)
        return True  # Return success, process later

# 4. Degrade gracefully
def get_recommendations_degraded(user_id: str) -> list:
    try:
        return call_ml_service(user_id)
    except (CircuitBreakerOpenError, requests.RequestException):
        # Return popular items instead of personalized
        return get_popular_items()

Common Pitfalls

Pitfall 1: Setting Thresholds Too Low

# BAD: Circuit opens too easily
circuit = CircuitBreaker(failure_threshold=1, recovery_timeout=5)

# GOOD: More resilient configuration
circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

Pitfall 2: No Fallback Strategy

# BAD: No fallback - will still fail
result = call_external_service()  # Raises exception when open

# GOOD: Always has fallback
result = get_data_with_fallback()  # Returns cached/default data

Pitfall 3: Ignoring Slow Calls

# BAD: Only counts exceptions
circuit = CircuitBreaker(failure_threshold=5)

# GOOD: Also handles slow calls
circuit = CircuitBreaker(
    failure_threshold=5,
    slow_call_timeout=5,  # Treat slow calls as failures
    slow_call_threshold=3
)

Tools and Libraries

Language Library Features
Java Resilience4j Circuit breaker, retry, rate limiter
.NET Polly Circuit breaker, retry, bulkhead
Python pybreaker Circuit breaker implementation
Go gobreaker Circuit breaker pattern
Node.js opossum Circuit breaker for Node.js
Ruby breaker_machines Modern circuit breaker

AWS Lambda Considerations

For serverless architectures, consider these approaches:

# AWS Lambda with circuit breaker using Step Functions
def lambda_handler(event, context):
    try:
        # Attempt the operation
        result = call_external_service()
        return {"statusCode": 200, "body": result}
    except CircuitBreakerOpenError:
        # Return cached or default response
        return {"statusCode": 200, "body": get_fallback_data()}
    except Exception as e:
        # Log and return error
        logger.error(f"Error: {str(e)}")
        return {"statusCode": 500, "body": "Internal error"}

Conclusion

The Circuit Breaker pattern is essential for building resilient distributed systems. By preventing cascade failures and providing graceful degradation, it helps maintain system availability even when dependencies fail. Key takeaways:

  1. Set appropriate thresholds based on service criticality
  2. Always implement fallbacks to provide degraded functionality
  3. Monitor circuit breaker state to detect issues early
  4. Use with other patterns like retry, timeout, and bulkhead for comprehensive resilience
  5. Configure recovery properly to allow services to recover

Implement circuit breakers at service boundaries, especially for external API calls and downstream dependencies. This pattern, combined with proper monitoring and fallback strategies, forms the foundation of resilient microservices architecture.

Resources

Comments